Pandas and Numpy

Pandas Series is a table with 1 column , row index and a name.

Series

  1. Pandas series automatically indetifies the type of the series type by just printing it out
  2. s = pd.Series()
  3. s
  4. Mix type is not a problem for pandas.Series as it will conver it all to object type
  5. We can acess a particular value in series using loc or iloc which selects the value given its name or index value respectively

Vectorization

  1. Pandas is also very fast in cdoing computation over the series.
  2. for example if we use for loop to count a series of number vs when we do it using np.sum(s)
  3. The time differnece is huge and shocking
  4. This can also be done for any type of data
  5. Becasue of vectorization and paralller programming.(more on this)
This is very important and to be known where which fucntion should be used for faster access

DataFrame

  1. DataFrame is equivalent to table in dataframe or collection of series.
  2. Just like series it it has index, since there are multipe series it has multiple name for series which are called column names
Row Name/Column Name Name 1 Name 2 Name 3
0 V1 V2 v3
1 V1 V2 v3
  1. and just like series the values can be accessed using loc and iloc fucntion
  2. Both take a row and a column or list of row and list of columns via names or via number
  3. Adding a column to dataframe is as easy as assigning them a value to dataframe
  4. We can create a DF using pandas.DataFrame() function which takes in value of iterative object.
  5. This iterative value can be a list or index
1234lis = [1,2,3,4,5,6]
lis = [1,2,3,4,5,6]
s= pd.Series(lis,name='A',index = ['Zero','One','Two','Three','Four','Five'])
df = pd.DataFrame(lis,columns=['A'],index = ['Zero','One','Two','Three','Four','Five'])

>Usually we work on dataset which we convert to dataframes to perform analysis on. Usually files as such can be csv, excel, text and we need to make sense of these files.

12o = pd.read_csv('olympics.csv')
o.head()

We can see that we have unwanted index and columns and what we actually need are the 1 row as column names and 1st column as index.
We can do by takeing advantage of the **read_csv()**

12o = pd.read_csv('olympics.csv',skiprows=1,index_col=0)
o.head() 

> Gives much better result but not entierly. We can still se that some column names do not make sense or can ambiguous. Lets do a little more formatting

123456for col in o.columns:
    if col[:2] == '01': o.rename(columns={col:'Gold'+col[5:]},inplace=True)
    if col[:2] == '02': o.rename(columns={col:'Silver'+col[5:]},inplace=True)
    if col[:2] == '03': o.rename(columns={col:'Bronze'+col[5:]},inplace=True)
    if col[:1] == '№': o.rename(columns={col:'#'+col[2:]},inplace=True)
o.head()
R/C #Summer Gold Silver Bronze Total #Winter Gold1 Silver1 Bronze1 Total.1 #Games Gold2 Silver2 Bronze2 Combined total
Afghanistan (AFG) 13 0 0 2 2 0 0 0 0 0 13 0 0 22
Algeria (ALG) 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15
Argentina (ARG) 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70
Armenia (ARM) 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12
Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12

This is much better and more understandable. Now we can use it for futher analysis purposes.

Boolean Masking

  1. Boolean masking is one the way do query our dataframe
1o['Silver'] >= 5 
123456Afghanistan (AFG)          False
Algeria (ALG)              False
Argentina (ARG)             True
Armenia (ARM)              False
Australasia (ANZ) [ANZ]    False
Name: Silver, dtype: bool

For example above expression will give us all the countries who have won 5 or more Silvers
The expression is broadcasted to all the values in o['Silver] series and returns a boolean output.